{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Synthetic data generation for fine-tuning custom retrieval and reranking models\n",
    "\n",
    "- **Goal**: Bootstrap, optimize and maintain your embedding models and rerankers through synthetic data generation and human feedback.\n",
    "- **Libraries**: [argilla](https://github.com/argilla-io/argilla), [hf-inference-endpoints](https://github.com/huggingface/huggingface_hub), [sentence-transformers](https://github.com/UKPLab/sentence-transformers)\n",
    "- **Components**: [LoadDataFromHub](https://distilabel.argilla.io/latest/components-gallery/steps/loaddatafromhub/), [GenerateSentencePair](https://distilabel.argilla.io/latest/components-gallery/tasks/generatesentencepair/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/)\n",
    "\n",
    "![GenerateSentencePair pipeline overview](../../../assets/pipelines/sentence-transformer.png)\n",
    "\n",
    "!!! note\n",
    "    For a comprehensive overview on optimizing the retrieval performance in a RAG pipeline, check this [guide](https://docs.zenml.io/user-guide/llmops-guide/finetuning-embeddings) in collaboration with [ZenML](https://github.com/zenml-io/zenml), an open-source MLOps framework designed for building portable and production-ready machine learning pipelines."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting started\n",
    "\n",
    "### Install the dependencies\n",
    "\n",
    "To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using **the free but rate-limited Hugging Face serverless Inference API** for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install \"distilabel[hf-inference-endpoints]\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install \"sentence-transformers~=3.0\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make the needed imports:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from distilabel.models import InferenceEndpointsLLM\n",
    "from distilabel.pipeline import Pipeline\n",
    "from distilabel.steps.tasks import GenerateSentencePair\n",
    "from distilabel.steps import LoadDataFromHub\n",
    "\n",
    "from sentence_transformers import SentenceTransformer, CrossEncoder\n",
    "import torch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You'll need an `HF_TOKEN` to use the HF Inference Endpoints. Login to use it directly within this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from huggingface_hub import login\n",
    "\n",
    "login(token=os.getenv(\"HF_TOKEN\"), add_to_git_credential=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (optional) Deploy Argilla\n",
    "\n",
    "You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following [this guide](https://docs.argilla.io/latest/getting_started/quickstart/). \n",
    "\n",
    "Along with that, you will need to install Argilla as a distilabel extra."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install \"distilabel[argilla, hf-inference-endpoints]\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make the extra needed imports:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "import argilla as rg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The dataset\n",
    "\n",
    "Before starting any project, it is always important to look at your data. Our data is publicly available [on the Hugging Face Hub](https://huggingface.co/datasets/plaguss/argilla_sdk_docs_raw_unstructured?row=0) so we can have a quick look through [their dataset viewer within an embedded iFrame](https://huggingface.co/docs/hub/datasets-viewer-embed). \n",
    "\n",
    "<iframe src=\"https://huggingface.co/datasets/plaguss/argilla_sdk_docs_raw_unstructured/embed/viewer\" frameborder=\"0\" width=\"100%\" height=\"560px\"></iframe>\n",
    "\n",
    "As we can see, our dataset contains a column called `chunks`, which was obtained from the Argilla docs. Normally, you would need to download and chunk the data but we will not cover that in this tutorial. To read a full explanation for how this dataset was generated, please refer to [How we leveraged distilabel to create an Argilla 2.0 Chatbot](https://huggingface.co/blog/argilla-chatbot#downloading-and-chunking-data).\n",
    "\n",
    "Alternatively, we can load the entire dataset to disk with `datasets.load_dataset`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Synthetic data generation\n",
    "\n",
    "The [`GenerateSentencePair`](https://distilabel.argilla.io/latest/components-gallery/tasks/generatesentencepair/) component from `distilabel` can be used to generate training datasets for embeddings models. \n",
    "\n",
    "It is a pre-defined `Task` that given an `anchor` sentence generate data for a specific `action`. Supported actions are: `\"paraphrase\", \"semantically-similar\", \"query\", \"answer\"`. In our case the `chunks` column corresponds to the `anchor`. This means we will use `query` to generate potential queries for a fine-tuning a retrieval model and that we will use `semantically-similar` to generate texts that are similar to the intial anchor for fine-tuning a reranking model.\n",
    "\n",
    "We will `triplet=True` in order to generate both positive and negative examples, which should help the model generalize better during fine-tuning and we will set `hard_negative=True` to generate more challenging examples that are closer to the anchor and discussed topics.\n",
    "\n",
    "Lastly, we can seed the LLM with `context` to generate more relevant examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "context = (\n",
    "\"\"\"\n",
    "The text is a chunk from technical Python SDK documentation of Argilla.\n",
    "Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets.\n",
    "Along with prose explanations, the text chunk may include code snippets and Python references.\n",
    "\"\"\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Retrieval\n",
    "\n",
    "For retrieval, we will thus generate queries that are similar to the `chunks` column. We will use the `query` action to generate potential queries for a fine-tuning a retrieval model.\n",
    "\n",
    "```python\n",
    "generate_sentence_pair = GenerateSentencePair(\n",
    "    triplet=True,  \n",
    "    hard_negative=True,\n",
    "    action=\"query\",\n",
    "    llm=llm,\n",
    "    input_batch_size=10,\n",
    "    context=context,\n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reranking\n",
    "\n",
    "For reranking, we will generate texts that are similar to the intial anchor. We will use the `semantically-similar` action to generate texts that are similar to the intial anchor for fine-tuning a reranking model. In this case, we set `hard_negative=False` to generate more diverse and potentially wrong examples, which can be used as negative examples for similarity fine-tuning because [rerankers cannot be fine-tuned using triplets](https://github.com/UKPLab/sentence-transformers/issues/2366).\n",
    "\n",
    "```python\n",
    "generate_sentence_pair = GenerateSentencePair(\n",
    "    triplet=True,\n",
    "    hard_negative=False,\n",
    "    action=\"semantically-similar\",\n",
    "    llm=llm,\n",
    "    input_batch_size=10,\n",
    "    context=context,\n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combined pipeline\n",
    "\n",
    "We will now use the `GenerateSentencePair` task to generate synthetic data for both retrieval and reranking models in a single pipeline. Note that, we map the `chunks` column to the `anchor` argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = InferenceEndpointsLLM(\n",
    "    model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n",
    "    tokenizer_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n",
    ")\n",
    "\n",
    "with Pipeline(name=\"generate\") as pipeline:\n",
    "    load_dataset = LoadDataFromHub(\n",
    "        num_examples=15,\n",
    "        output_mappings={\"chunks\": \"anchor\"},\n",
    "    )\n",
    "    generate_retrieval_pairs = GenerateSentencePair(\n",
    "        name=\"generate_retrieval_pairs\",\n",
    "        triplet=True,\n",
    "        hard_negative=True,\n",
    "        action=\"query\",\n",
    "        llm=llm,\n",
    "        input_batch_size=10,\n",
    "        context=context,\n",
    "    )\n",
    "    generate_reranking_pairs = GenerateSentencePair(\n",
    "        name=\"generate_reranking_pairs\",\n",
    "        triplet=True,\n",
    "        hard_negative=False,  # to potentially generate non-relevant pairs\n",
    "        action=\"semantically-similar\",\n",
    "        llm=llm,\n",
    "        input_batch_size=10,\n",
    "        context=context,\n",
    "    )\n",
    "\n",
    "    load_dataset.connect(generate_retrieval_pairs, generate_reranking_pairs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we can execute this using `pipeline.run`. We will provide some `parameters` to specific components within our pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generation_kwargs = {\n",
    "    \"llm\": {\n",
    "        \"generation_kwargs\": {\n",
    "            \"temperature\": 0.7,\n",
    "            \"max_new_tokens\": 512,\n",
    "        }\n",
    "    }\n",
    "}\n",
    "\n",
    "distiset = pipeline.run(  \n",
    "    parameters={\n",
    "        load_dataset.name: {\n",
    "            \"repo_id\": \"plaguss/argilla_sdk_docs_raw_unstructured\",\n",
    "            \"split\": \"train\",\n",
    "        },\n",
    "        generate_retrieval_pairs.name: generation_kwargs,\n",
    "        generate_reranking_pairs.name: generation_kwargs,\n",
    "    },\n",
    "    use_cache=False,  # False for demo\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data generation can be a expensive, so it is recommended to store the data somewhere. For now, we will store it on the Hugging Face Hub, using  our `push_to_hub` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "distiset.push_to_hub(\"[your-owner-name]/example-retrieval-reranking-dataset\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have got 2 different leaf/end nodes, therefore we've got a distil configurations we can access, one for the retrieval data, and one for the reranking data.\n",
    "\n",
    "<iframe\n",
    "  src=\"https://huggingface.co/datasets/distilabel-internal-testing/example-retrieval-reranking-dataset/embed/viewer/generate_reranking_pairs/train\"\n",
    "  frameborder=\"0\"\n",
    "  width=\"100%\"\n",
    "  height=\"560px\"\n",
    "></iframe>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at these initial examples, we can see they nicely capture the essence of the `chunks` column but we will need to evaluate the quality of the data a bit more before we can use it for fine-tuning."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data quality evaluation \n",
    "\n",
    "Data is never as clean as it can be and this also holds for synthetically generated data too, therefore, it is always good to spent some time and look at your data.\n",
    "\n",
    "### Feature engineering\n",
    "\n",
    "In order to evaluate the quality of our data we will use features of the  models that we intent to fine-tune as proxy for data quality. We can then use these features to filter out the best examples.\n",
    "\n",
    "In order to choose a good default model, we will use the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). We want to optimize for size and speed, so we will set model size `<100M` and then filter for `Retrieval` and `Reranking` based on the highest average score, resulting in [Snowflake/snowflake-arctic-embed-s](https://huggingface.co/Snowflake/snowflake-arctic-embed-s) and [sentence-transformers/all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) respectively.\n",
    "\n",
    "<iframe\n",
    "\tsrc=\"https://mteb-leaderboard.hf.space\"\n",
    "\tframeborder=\"0\"\n",
    "\twidth=\"100%\"\n",
    "\theight=\"600\"\n",
    "></iframe>\n",
    "\n",
    "#### Retrieval\n",
    "\n",
    "For retrieval, we will compute similarities for the current embeddings of `anchor-positive`, `positive-negative` and `anchor-negative` pairs. We assume that an overlap of these similarities will cause the model to have difficulties generalizing and therefore we can use these features to evaluate the quality of our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_id = \"Snowflake/snowflake-arctic-embed-m\"  # Hugging Face model ID\n",
    "\n",
    "model_retrieval = SentenceTransformer(\n",
    "    model_id, device=\"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we will encode the generated text pairs and compute the similarities. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "\n",
    "def get_embeddings(texts):\n",
    "    vectors = model_retrieval.encode(texts)\n",
    "    return [vector.tolist() for vector in vectors]\n",
    "\n",
    "\n",
    "def get_similarities(vector_batch_a, vector_batch_b):\n",
    "    similarities = []\n",
    "    for vector_a, vector_b in zip(vector_batch_a, vector_batch_b):\n",
    "        similarity = cosine_similarity([vector_a], [vector_b])[0][0]\n",
    "        similarities.append(similarity)\n",
    "    return similarities\n",
    "\n",
    "def format_data_retriever(batch):# -> Any:\n",
    "    batch[\"anchor-vector\"] = get_embeddings(batch[\"anchor\"])\n",
    "    batch[\"positive-vector\"] = get_embeddings(batch[\"positive\"])\n",
    "    batch[\"negative-vector\"] = get_embeddings(batch[\"negative\"])    \n",
    "    batch[\"similarity-positive-negative\"] = get_similarities(batch[\"positive-vector\"], batch[\"negative-vector\"])\n",
    "    batch[\"similarity-anchor-positive\"] = get_similarities(batch[\"anchor-vector\"], batch[\"positive-vector\"])\n",
    "    batch[\"similarity-anchor-negative\"] = get_similarities(batch[\"anchor-vector\"], batch[\"negative-vector\"])\n",
    "    return batch\n",
    "\n",
    "dataset_generate_retrieval_pairs = distiset[\"generate_retrieval_pairs\"][\"train\"].map(format_data_retriever, batched=True, batch_size=250)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "#### Reranking\n",
    "\n",
    "For reranking, we will compute the compute the relevance scores from an existing reranker model for `anchor-positive`, `positive-negative` and `anchor-negative` pais and make a similar assumption as for the retrieval model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_id = \"sentence-transformers/all-MiniLM-L12-v2\"\n",
    "\n",
    "model = CrossEncoder(model_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we will compute the similarity for the generated text pairs using the reranker. On top of that, we will compute an `anchor-vector` to allow for doing semantic search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def format_data_retriever(batch):# -> Any:\n",
    "    batch[\"anchor-vector\"] = get_embeddings(batch[\"anchor\"])\n",
    "    batch[\"similarity-positive-negative\"] = model.predict(zip(batch[\"positive-vector\"], batch[\"negative-vector\"]))\n",
    "    batch[\"similarity-anchor-positive\"] = model.predict(zip(batch[\"anchor-vector\"], batch[\"positive-vector\"]))\n",
    "    batch[\"similarity-anchor-negative\"] = model.predict(zip(batch[\"anchor-vector\"], batch[\"negative-vector\"]))\n",
    "    return batch\n",
    "\n",
    "dataset_generate_reranking_pairs = distiset[\"generate_reranking_pairs\"][\"train\"].map(format_data_retriever, batched=True, batch_size=250)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And voila, we have our proxies for quality evaluation which we can use to filter out the best and worst examples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (Optional) Argilla\n",
    "\n",
    "To get the most out of you data and actually look at our data, we will use Argilla. If you are not familiar with Argilla, we recommend taking a look at the [Argilla quickstart docs](https://docs.argilla.io/latest/getting_started/quickstart/). Alternatively, you can use your Hugging Face account to login to the [Argilla demo Space](https://argilla-argilla-template-space.hf.space).\n",
    "\n",
    "To start exploring data, we first need to define an `argilla.Dataset`. We will create a basic datset with some input `TextFields` for the `anchor` and output `TextQuestions` for the `positive` and `negative` pairs. Additionally, we will use the `file_name` as `MetaDataProperty`. Lastly, we will be re-using the vectors obtained from our previous step to allow for semantic search and we will add te similarity scores for some basic filtering and sorting."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we need to define the setting for our Argilla dataset. We will create two different datasets, one for the retrieval data and one for the reranking data to ensure our annotators can focus on the task at hand."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import argilla as rg\n",
    "from argilla._exceptions import ConflictError\n",
    "\n",
    "api_key = \"ohh so secret\"\n",
    "api_url = \"https://[your-owner-name]-[your-space-name].hf.space\"\n",
    "\n",
    "client = rg.Argilla(api_url=api_url, api_key=api_key)\n",
    "\n",
    "settings = rg.Settings(\n",
    "    fields=[\n",
    "        rg.TextField(\"anchor\")\n",
    "    ],\n",
    "    questions=[\n",
    "        rg.TextQuestion(\"positive\"),\n",
    "        rg.TextQuestion(\"negative\"),\n",
    "        rg.LabelQuestion(\n",
    "            name=\"is_positive_relevant\",\n",
    "            title=\"Is the positive query relevant?\",\n",
    "            labels=[\"yes\", \"no\"],\n",
    "        ),\n",
    "        rg.LabelQuestion(\n",
    "            name=\"is_negative_irrelevant\",\n",
    "            title=\"Is the negative query irrelevant?\",\n",
    "            labels=[\"yes\", \"no\"],\n",
    "        )\n",
    "    ],\n",
    "    metadata=[\n",
    "        rg.TermsMetadataProperty(\"filename\"),\n",
    "        rg.FloatMetadataProperty(\"similarity-positive-negative\"),\n",
    "        rg.FloatMetadataProperty(\"similarity-anchor-positive\"),\n",
    "        rg.FloatMetadataProperty(\"similarity-anchor-negative\"),\n",
    "    ],\n",
    "    vectors=[\n",
    "        rg.VectorField(\"anchor-vector\", dimensions=model.get_sentence_embedding_dimension())\n",
    "    ]\n",
    ")\n",
    "rg_datasets = []\n",
    "for dataset_name in [\"generate_retrieval_pairs\", \"generate_reranking_pairs\"]:\n",
    "    ds = rg.Dataset(\n",
    "        name=dataset_name,\n",
    "        settings=settings\n",
    "    )\n",
    "    try:\n",
    "        ds.create()\n",
    "    except ConflictError:\n",
    "        ds = client.datasets(dataset_name)\n",
    "    rg_datasets.append(ds)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we've got our dataset definitions setup in Argilla, we can upload our data to Argilla."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds_datasets = [dataset_generate_retrieval_pairs, dataset_generate_reranking_pairs]\n",
    "\n",
    "records = []\n",
    "\n",
    "for rg_dataset, ds_dataset in zip(rg_datasets, ds_datasets):\n",
    "    for idx, entry in enumerate(ds_dataset):\n",
    "        records.append(\n",
    "            rg.Record(\n",
    "                id=idx,\n",
    "                fields={\"anchor\": entry[\"anchor\"]},\n",
    "                suggestions=[\n",
    "                    rg.Suggestion(\"positive\", value=entry[\"positive\"], agent=\"gpt-4o\", type=\"model\"),\n",
    "                    rg.Suggestion(\"negative\", value=entry[\"negative\"], agent=\"gpt-4o\", type=\"model\"),\n",
    "                ],\n",
    "                metadata={\n",
    "                    \"filename\": entry[\"filename\"],\n",
    "                    \"similarity-positive-negative\": entry[\"similarity-positive-negative\"],\n",
    "                    \"similarity-anchor-positive\": entry[\"similarity-anchor-positive\"],\n",
    "                    \"similarity-anchor-negative\": entry[\"similarity-anchor-negative\"]\n",
    "                },\n",
    "                vectors={\"anchor-vector\": entry[\"anchor-vector\"]}\n",
    "            )\n",
    "        )\n",
    "    rg_dataset.records.log(records)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can explore the UI and add a final human touch to get the most out of our dataset. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fine-tuning\n",
    "\n",
    "At last, we can fine-tune our models. We will use the `sentence-transformers` library to fine-tune our models.\n",
    "\n",
    "### Retrieval\n",
    "\n",
    "For retrieval, we have created a script that fine-tunes a model on our generated data the generated data based [https://github.com/argilla-io/argilla-sdk-chatbot/blob/main/train_embedding.ipynb](https://github.com/argilla-io/argilla-sdk-chatbot/blob/main/train_embedding.ipynb).You can also [open it in Google Colab directly](https://githubtocolab.com/argilla-io/argilla-sdk-chatbot/blob/main/train_embedding.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reranking\n",
    "\n",
    "For reranking, `sentence-transformers` provides a script that shows [how to fine-tune a CrossEncoder models](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/cross-encoder). Ad of now, there is [some uncertainty over fine-tuning CrossEncoder models with triplets](https://github.com/UKPLab/sentence-transformers/issues/2366) but you can still use the `positive` and `anchor`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusions\n",
    "\n",
    "In this tutorial, we present an end-to-end example of fine-tuning retrievers and rerankers for RAG. This serves as a good starting point for optimizing and maintaining your data and model but need to be adapted to your specific use case.\n",
    "\n",
    "We started with some seed data from the Argilla docs, generated synthetic data for retrieval and reranking models, evaluated the quality of the data, and showed how to fine-tune the models. We also used Argilla to get a human touch on the data."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".env",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
